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Abstract 

We consider a social system of interacting heterogeneous agents with learning abilities, 
" ^ a model close to Random Field Ising Models, where the random field corresponds to the 

idiosyncratic willingness to pay. Given a fixed price, agents decide repeatedly whether to 
buy or not a unit of a good, so as to maximize their expected utilities. We show that the 
O | equilibrium reached by the system depends on the nature of the information agents use to 

estimate their expected utilities. 

^ 1 Introduction 

<N ...... . n . . r 

Individual decisions in social systems are frequently influenced by the behaviors or choices of 
CN| other individuals. Besides the obvious case of fashion [25], many situations of social influence 

have been considered and analyzed in the literature. They range from sociological issues like 
the decision of attending a bar that may be crowded [2 , a seminar that may have vanishing 
attendance |36j , choosing a movie or a restaurant [3] , committing crime |16j , to political issues 
such as the decision of joining a riot [13], voting for or against a new constitution [2], etc. 

The first models, proposed by Schelling |34j . were aimed at demonstrating that the collective 
outcomes when individuals interact socially with each other may seem paradoxical - that is, 
intuitively inconsistent with the intentions of the individuals who generate them. In fact, 
the collective states that result from the aggregation of individual decisions not voluntarily 
coordinated, cannot be predicted by any simple counting or extrapolation of the individual 
preferences. Schelling 36J built simple models of social paradoxes, like the existence of racial 
segregation in urban neighbourhoods despite the non-racist character of the inhabitants, the 
death of a weekly seminar by lack of participants despite their interest on it, etc. The reason of 
these paradoxes is to be found in the fact that systems with interacting individuals may present 
multiple equilibria. These may be analyzed in a natural way in the framework of statistical 
physics. 

Models of interacting agents facing binary decision problems have been considered within 
an economical framework (0 [2J3 [TTJ [T5]) after Follmer [T3] first used the finite-temperature 
Ising model in a homogeneous external field to analyze equilibria in a two-goods market. 
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In this paper we consider a general model introduced in Gordon et al. [TH] and Nadal et 
al. |26j . where the interacting agents have different private willingnesses to pay, i.e. different 
local fields. The individual utilities are the sum of the private and the interactions terms. 
Interactions are assumed to be global and positive, so that utilities increase proportionally to 
the total fraction of buyers. Global interactions are pertinent when the individual utilities 
depend on decisions of remote and probably unknown individuals. This is the case of the 
subscription to a telephone network [3TJ[52], or the choice of a standard [2U], where making the 
same decision as the majority carries advantages. Notice that this kind of aggregated data may 
be easily available through public information. In statistical mechanics, this model belongs to 
the class of mean-field ferromagnetic Random Field Ising Models. 

As shown by Gordon et al. [T7] when the social interactions are strong enough, the system 
presents multiple (Nash) zero temperature equilibria. The one that is individually and globally 
optimal is called Pareto-dominant equilibrium in economics. However, in contrast with physical 
systems where energy and entropy determine the actual thermodynamic equilibrium, none of the 
possible equilibria may be ruled out in social systems. Multiple equilibria bring on coordination 
dilemmas to the agents. The equilibrium actually reached by the system depends on the decision 
making dynamics. 

In game theory, mostly limited to two-player games, it is usually assumed that individuals 
possess the skills and the information necessary to analyze the consequences of all the possible 
outcomes. Thus, they are able to find which is the optimal decision, and thus realize the 
Pareto-optimal equilibrium. However, in situations with large numbers of participants like the 
one considered here, or in situations of uncertainty, individuals may be unable to grasp the 
information necessary for coordination. In fact, they are more likely to rely on beliefs rather 
than on a perfectly rational reasoning to make decisions. Deviations from rationality may 
arise not only in situations of limited or incomplete information, but also due to human errors, 
different psychological attitudes with respect to risk, etc. 

We are interested in situations where agents make their decisions repeatedly. In that case 
they may modify their beliefs by learning through past experiences. To this end we assume that 
each agent associates an expected surplus or payoff to buying. Once the decisions are made ac- 
cording to these expectations, the latter are in turn updated based on the grasped information, 
using a learning rule. This process is called learning upon experience in the literature. Behav- 
ioral learning is actually the subject of important theoretical studies in different disciplines, in 
particular in the context of game theory (see e.g. |U [5TJ [551 13H]) and in 'econophysics' (see 
e.g. 0[531[TJ|57]). Quite importantly, an increasing access to empirical data allows to compare 
theoretical predictions with observed behaviors [71 137], 

We have studied the equilibria reached by the system for different learning rules proposed in 
the literature to explain outcomes in experimental economics. Following Camerer |7J, we intro- 
duce a small number of parameters allowing to study all these rules within a single framework. 
Here we report our most interesting results, obtained through weighted belief learning and 
reinforcement learning. In these settings, buyers update their expectations according to their 
obtained surplus, while non-buyers use a degraded information. We compare results for two 
different information conditions. In one of them, that we call s-learning, the agents estimate the 
expected surpluses based on actual payoffs. In the other, called ^-learning, they estimate them 
based on the fraction of individuals they expect will buy. In the latter case, agents arc assumed 
to know the additive structure of the utility function. Our analysis is limited to populations of 
homogeneous learners: all the agents use the same learning rule, although they make different 
initial guesses. Results obtained with the standard iterative steepest ascent used to determine 
the local equilibria in Ising model simulations, where at each time step individuals make the 
best decision conditionally to the previous period outcome - a dynamics called myopic fictitious 
play in game theory -, serve as reference for our analysis. 

We show that coordination of learners on the optimal (Nash) equilibrium, not only in the 
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presence of multiple equilibria but even when the equilibrium is unique, is far from being the 
norm. In fact, fairly restrictive conditions are needed. The emerging collective state depends 
strongly on the values of the learning parameters, and is very sensitive to the agents' initial 
beliefs. There are significant differences in the aggregate values obtained trough the simulation 
of both learning scenarios. Previous results corresponding to s-learning have been reported 
elsewhere |35j . The performances along the learning paths as well as the incidence of different 
initial conditions on the collective behaviour were thoroughly detailed. In a forthcoming paper 
[25 we will present an analytical study of the stationary regime attained through the learning 
dynamics and probabilistic decision-making. 

The paper is organized as follows: in section[2] we present the agents model and its statistical 
mechanics equilibrium properties. In section [3] we describe the learning scenarios. We present 
the general settings of our simulations in section [4] and the results in section [5] Section [6] 
concludes the paper. 



2 Model with heterogeneous interacting agents 

We consider a social system of N heterogeneous agents (i = 1,2,..., N) that must decide either 
to buy (ui = 1) or not (oji = 0) one unit of a single good at an exogenous price P. Following 
Nadal et al. we assume that each agent i has a willingness to pay Hi, which represents 
the maximal amount he is ready to pay for the good in the absence of social interactions. The 
values Hi are assumed to be randomly distributed among the agents according to a probability 
density function of average H and variance <jh- In addition to this idiosyncratic term, the 
decisions of other agents exert an additive social influence on each individual i, increasing his 
willingness to pay if others buy. This influence is assumed to be proportional to the fraction of 
buyers (other than i): 

1 N 

The utility of buying, the surplus, for individual i is: 

Si = Hi + Jrji - P, (2) 

where J > 0, the weight of the social influence, is assumed to be the same for all the agents. 

The equilibrium properties of this model have been analyzed using the mean field approx- 
imation, for different distributions of the Hi |181 130] . More recently, the properties for very 
general distributions have been determined |17j . Hereafter we briefly summarize the main re- 
sults, that we illustrate for the particular case of the triangular distribution considered in our 
simulations. The latter allows for a complete analytical equilibrium study |30j . 

In the thermodynamic limit N — > oo, rji in ([I]) and (|2]) may be approximated by 



N 

1 

V 



N 

fc=i 

It is useful to introduce the following reduced parameters 

H, -H J . H - P 



1 N 



Xi 



,j=—,S= , (4) 



a H cth a H 



where 5 is the (reduced) gap between the average willingness to pay and the price. Xi represents 
the (reduced) idiosyncratic preference of agent i. It is a random variable of zero mean and 
unitary variance, distributed among the population according to a probability density function 
(pdf) /(a). 
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A rational agent chooses the strategy u>i that maximizes his (reduced) surplus 



(5) 



that is: 



arg max ujSi. 
ue{o,i} 



(6) 



In the thermodynamic limit, the fraction of buyers at equilibrium is equal to the probability of 
buying of the population: 



f) = V(S + x l +jr ] > 0) = V(xi > -s), 



(7) 



where s = S+jr] is the population's average surplus. Solutions to equation Q give the fractions 
of buyers at equilibrium, which depend on the parameters of the model. The properties of the 
system may be summarized on a phase diagram, in which the lines separating different regimes 
of Nash equilibria are plotted in the space of the model parameters, namely j and S. The 
main results are that if the (reduced) strength of the social interactions j is larger than a 
distribution dependent value jg — 1/ f max ( fmax is the maximum of the pdf ) , there is a range 
of values of 6 where two different (stable) equilibria coexist: one with a large fraction of buyers, 
the efficient Pareto-optimal one where coordination is achieved, and another with a smaller 
fraction of buyers. This multiplicity is a generic property of models with social interactions 

H3. 

In our simulations, the reduced variables Xi are randomly distributed according to the 
following triangular pdf: 

/(X)=2(2 9^ ) if ~ b ^ x ^ 2b > (8) 

with b = y/2. Outside the support [~b,2b], f(x) — 0. The maximum of f(x) is reached at the 
left boundary, f max — f(—b) — 2/(36). The solutions to Q are straightforward [3D], and are 
represented as a function of <5 for different values of j on figure [T] 




S„ 5 u< 4 > 5 l< 4 ' 



i with maximum 



Figure 1: Demand r)(j; 5) as a function of S = h — p for different values of j, for the triangular pdf jsb i 
at x — —b. Notice that <5 — corresponds to h — p. Unstable solutions are not shown. B: bifurcation point, U. L: 
boundaries of the region with multiple equilibria for j — 4. 



The critical value of j is js — 1/ fmax ~ 2.12. For j < jb, 6) is a single-valued function 
(see figure [l] for the particular value j = 1 < js)- If 5 < 5q = —2b prices are so high with 
respect to the average willingness to pay of the population that there are no buyers at all 
(j) = 0), i.e. there is no market. At the other end, if 5 > 5i(j) = b — j, prices are so low that 



4 



the market saturates (77 = 1). These saturation effects arise because the support of / is finite. 
For Sq < 6 < Si(j), rj(j; S^is a monotonically increasing function of 5: 

For j > ]b there is a range of values of 8, 8u(S) < 8 < 8l{j) with 6l(j) = —26 + and 
Su(j) — b — j for which there are two solutionsr] that we denote T]u(j;8) and r]L(j;5), with 
Vu(j', 8) > tjlU', 8) for all the range of 8 where they coexist (see t](j; 8) on figure [T] for j = 4 > 
js)- More precisely, the low-?? branch, 8), exists for S < SlU)- Its dependence with j and 
S is the same as in equation Q. At S = §i,{j) it reaches its largest value: rft{j\ SlU)) = VlU)- 
The high- 77 branch exists for (5 > Sjj(j). In our case, it corresponds to saturation (i]u(j; 8) = 1). 
The Pareto-optimal equilibrium is rju(j]S), since it corresponds to the largest utility for all 
the buyers, which are in turn more numerous than in the equilibrium J7l(j;(5). However, the 
equilibrium actually reached by the system depends on the decision making process, which we 
study in the next section. 

These results are summarized on the phase diagram of figure [2] where the saturation lines 
and the parameter region with two solutions (grey area) are represented. 




Figure 2: Customers phase diagram for the triangular pdf dsb. Grey region: coexistence of two equilibria, rj — is an 
equilibrium within the obliquely-hashed region (5 < (So), V —1 (saturation) is an equilibrium within the horizontally- 
hashed region. Points (a) to (/) refer to the parameters considered in section ml 



3 Learning dynamics 

We are interested in the equilibria reached by the system when the customers make their 
decisions repeatedly, at successive periods, based on information grasped from their past actions. 
We assume that at each period the agents do not know a priori the payoffs corresponding to 
each possible strategy. They rely on their own beliefs or estimations to make their decisions. 

1 We use the convention that the first terms in parenthesis are parameters, and the term after the semicolon 
is the variable. 

2 Notice that = * s a degeneracy due to the fact that the pdf reaches its maximum at a boundary 

of the support. 
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In the present case of binary decisions it is sufficient to estimate the difference between the 
payoffs expected upon buying with respect to not-buying. Thus, individuals need to estimate a 
single value, that we call hereafter attraction for buying, or simply attraction following Camerer 



We consider two different learning scenarios which differ in the kind of information assumed 
to be available to the customers. In the first one - that we call s-learning -, customers do not 
know the parameters nor the structure of the surplus function on which they have to make 
expectations. They make direct estimations of the payoffs expected upon buying (in our model 
the expected payoff for not buying vanishes). Starting with some initial beliefs aj(O), at each 
iteration t individuals i € {1, ...,iV} make their decisions u)i(t) for the period according to 
the attractions <Xj(i) and then update the latter based on the obtained payoffs Sj(t). In the 
second scenario - called hereafter 77 -learning -, each agent is assumed to know the gap between 
his idiosyncratic willingness to pay and the price (5 + as well as the strength of the social 
interactions j. He only needs to estimate the expected fraction of buyers fji(t), in order to 
determine his attraction for buying <Zj(t) = S + Xj + j fji(t). 

The system is updated iteratively: at each period t each agent i chooses a strategy w,-(t) 
based on his attraction ai(t). This choice may be probabilistic, but here we concentrate on 
a deterministic decision making process. Once decisions are made, attractions are updated 
using the grasped information. More precisely, the system evolves according to the following 
two-steps dynamics: 



Making decisions: each individual makes the choice that maximizes his expected payoff. 
Thus, if the attraction for buying is positive, the choice is u>i{t) = 1, otherwise u>i(t) = 0. This 
is called myopic best response in the literature. Since attractions are estimated payoffs, 

wi(t) = e(oi(t)). (10) 

where 0(x) is the Heaviside function (0(x) = 1 if x > 0, 0(x) = otherwise). Notice that this 
deterministic decision rule depends only on the sign of the attraction but not on its magnitude. 
The surplus or earned payoff is then 

8i(t)=Ui(t){S + Xi+jt](t)) (11) 

where 77(f) is the actual fraction of buyers of the period. Since attractions may be inaccurate or 
erroneous estimations of the latter, the agents may make bad decisions and either get negative 
payoffs or miss positive ones. 



Updating attractions: be Zi the quantity on which the individuals make estimations (sj or 
77i, depending on the learning scenario). Individual i updates Zj(i), the estimation at time t, 
using the information obtained as a result of his decision u>i(t). The updating rules considered 
hereafter have the following structure: 



Zi{t + 1) - (1 - /1) Zi(t) + fi[A + (1 - A)ui(t)]zi(t) 



(12) 



where < fi < 1 is the learning rate and A is a parameter (0 < A < 1) that allows to 
update differently Sj depending on the period's decision Wj(i). Notice that Zj(rj) in the right 



hand side of (12 1 is the actual value of z(t) after the decision uji{t) of period t is made and 



the corresponding payoff (if any) is earned. In particular, the learning rule obtained by setting 
A = 1 in (12 1 is known in the literature as fictitious play [71(5]: unconditionally to cjj(t), the 
value Bi(t + 1) is updated using the actual value Zi(t). If A = 0, the rule (12) gives raise to 
the usual reinforcement learning [TH , in which the estimated quantity z is updated only if 
uii(t) = 1. Another well known rule, the standard Cournot best reply [10 , in which only the 
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previous period counts, is obtained putting fi = 1 and A = 1 in (12 1. The latter corresponds 
to a standard parallel steepest ascent search of the (eventually local) optimum. 

In the s-learning scenario, introducing Zi(t) = Si(t) with Sj(i) given by equation (111, and 
Zi = cli in equation ( |12| ) gives the time evolution of the attraction: 

Oi(t + 1) = (1 - n) Oi(t) + y. [A + (1 - A)a;<(i)] (13) 

In the case of 77-learning, — rji and ii — rji, so that, after introduction into equation ( |12[ ) 
and some algebra, the evolution of the corresponding attraction is: 



Oi(i + 1) = (1 - /i) Oi(i) + ;u (5 + Xi + j [A + (1 - A)wi(i)] 
Both rules coincide within the fictitious play paradigm, i.e. for A = 1. 



(14) 



4 General simulation-settings 

In this section we present the common general settings of our simulations. Results obtained 
with the two different learning scenarios presented in the preceding section, namely s-learning 
and 77-learning, are discussed in the next section. 

4.1 Systems parameters 

Simulations were done for different values of 5, defined by Q. The values of Xi, the (centered) 
idiosyncratic component of the willingness to pay, are drawn according to the triangular pdf 
([8]). Since this pdf is a decreasing function of Xi, there are fewer individuals with high than 
with low values of x%. As a consequence, our histograms of final states as a function of Xi have 
better statistics for low values than for large values of x^. 

We focus on the learning behavior for two values of the social influence weight j, one below, 
the other above, the critical value Jb — 36/2 « 2.12 (see section[2]). These are j = 1 which has 
a single equilibrium for any value of 5, and j — 4, which may present two possible equilibria 
for the range 5u(j) < 6 < S L (j) with 5u(4) = -2.5858 and 6 L (4) = -1.7034. At equilibrium, 
due to the boundedness of the support of the IWP, 77 — below So w 2.83. For j = 1 we have 
77 = 1 above 5i(l) ~ 0.41, whereas for j = 4 saturation (77 = 1) is a possible equilibrium for 

5 > 8u{A). 

All the presented simulations correspond to systems with N = 1 000 agents, averaged over 
100 systems, i.e. corresponding to 100 different realizations of the random idiosyncratic will- 
ingnesses to pay (IWP). We present results corresponding to synchronous (parallel) updating, 
where the procedure detailed in the preceding section is iterated until convergence. Results with 
sequential asynchronous dynamics |35j only differ in the time needed to converge, the reached 
equilibria being similar. 

We performed thorough simulations, obtaining statistics of learning times, cumulated pay- 
offs, etc. In this article we describe the most interesting results, which are the fractions of 
buyers and the distribution of attractions at convergence, because they allow to understand the 
differences between the different types of learning schemes. 

4.2 Initial states 

We assume that the agents start with some initial values of their attractions, which represent 
their a priori beliefs. Among the different possibilities of defining the initial beliefs, we analyze 
systematically three different initializations: 
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optimistic: in s-learning, the initial attractions <Zj(0) are randomly selected positive numbers 
in the interval [0, 1] for all 1 < i < N, so that the very first decision for all the agents is 
to buy. In ^-learning, the initial values are r)i(0) = 1 for all 1 < i < N, so that the initial 
attractions are Oj(O) = Xj + S + j. Notice that in this case, the decisions of agents with 
Xi < —j — 5 is not strategic: they will choose not to buy in the first period despite their 
optimistic guess on r)i(0), because their IWP is too small. 

pessimistic: in s-learning, the initial attractions aj(0) are randomly selected negative num- 
bers in the interval [—1,0] for all 1 < i < N. At the first iteration, no agent buys. 
In 77- learning, the initial values are j)j(0) = for all 1 < i < N. Here, the choices of 
individuals with xi > —S are not strategic because their payoffs upon buying are posi- 
tive independently of the choices of the other agents. Thus, their first period choice is 
w»(0) - 1. 

random: in s-learning, the initial attractions <ij(0) are randomly selected numbers in the 
interval [—1,1] for all 1 < i < N. In ^-learning, the initial values ?}i(0) are random 
numbers in [0, 1] for all 1 < i < N. Here, agents with Xi + 6 > (resp. xi + S — j < 0) 
buy (resp. do not buy) unconditionally to the individual estimations i)i(0). Those with 
—j < Xi + S < 0, i.e. those whose decision is actually dependent on the collective outcome, 
will buy only if fji(0) > — (<5 + Xi)/j. 

The two first initializations correspond to extreme cases. They lead to equilibria that are 
respectively upper and lower bounds to the fractions of buyers at equilibrium reached with 
other initializations. 



5 Simulations results 

We first present results obtained with myopic fictitious play, for which both learning scenarios 
coincide. This corresponds to the usual dynamics used in spin systems, in which at each iteration 
spins are aligned with their local fields. Since the interactions between agents are symmetric, 
the system has an underlying energy function. Thus, the dynamics has fixed point attractors, 
which are the equilibrium states presented in section [2] These results will serve as reference 
against which we compare the results of weighted belief and reinforcement learning. 



5.1 Myopic fictitious play 



This dynamics is achieved by putting A — 1 and /i = 1 in equation (12 1. It is called myopic 
because it is a response to the previous time step only: agents completely disregard older expe- 
riences and do not try to make elaborate expectations on future outcomes. Fictitious because 
agents are assumed to have knowledge of the values (s or rf) used to build their attractions 
independently of whether they buy or not. 

The fractions of buyers r\ at equilibrium, obtained for different values of 5, are presented 
on figure |3l Symbols correspond to simulations, the lines being the solutions to the mean field 
equation ([7]) with the triangular IWP distribution ([8| , represented on figure 1 In the range 
< 77 < 1 (excluding 77 = and r\ = 1), these solutions are given by equation (9 1. 

Figure [3] (left) displays results for j = 1 < jg. This dynamics corresponds to steepest 
ascent in the states space, so that the system reaches the fixed point closest to the initial state. 
Since for j = 1 there is only one fixed point for each value of S, the system converges to it 
independently of the initialization. For 5o < 6 < ^i(j), rj is the fraction of agents that satisfy 
Xi + S + jrj > 0. In the region d > <5i(j), these are all the agents. If 6 < Sq, no agent has an 
IWP large enough to get a positive payoff, and the equilibrium is 77 = 0. 
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Figure 3: Myopic fictitious play (ju = 1, A = 1). r) at equilibrium versus 8 = h — p for j — 1 and j — 4. In these 
and all the following figures, when non visible, error bars are smaller than the symbols' sizes. The full lines are the 
analytical predictions for the rational (Nash) equilibria. Numerical values of So, 8i(j), 8u(j) and 8£,(j) are given in 
section |4. 1| 



For j > 2b and Su(j) < $ < $l(J) we expect, based on the phase diagram, that different 
initializations lead the system to different equilibria. Indeed, the optimistic (pessimistic) ini- 
tialization systematically drives the system to the high-77 (low-77) equilibrium. Systems with 
random initialization end up at either of the two equilibria, depending on the precise configu- 
ration of initial states (see figure [3] right). Actually, with this initialization the distribution of 
7] is bimodal; this is why the averages in the coexistence region present larger variances than 
elsewhere. With different initial fractions of buyers, the number of simulated systems that end 
up at each attr actor differs. 




Figure 4: Myopic fictitious play (fi — 1, A — 1). Attractions at convergence for three different values of 5, as a 
function of Xi. For j — 1, the fractions of buyers are rj(8 — 1.1) — 1, r](S — —0.5) — 0.403 and r}(8 — —3.1) — 0. For 
j — 4, rj(S — —1.5) — 1 and r/(8 — —3) — 0. For 8 — —2 attractions converge to two different fixed points (77 — 0.07 
and rj — 1), depending on the initial condition. The characters in parenthesis refer to the points in the phase diagram 
(figure [2}. 

The stationary distribution of the individual attractions aj of a single typical system are 
plotted on figures [4] against the idiosyncratic terms Xi , for different values of 8 (they correspond 
to the equilibrium states (a) — (/) in the phase diagram [2]). As expected, the a, are the actual 
payoffs at equilibrium, which are proportional to rj. The slope of dj vs. xi is 1, as it should, 
since a, = S + Xj + jtj, the ordinate at the origin being S + jrj. 

Results for j = 4 and S = —2 (point (e) in the phase diagram, inside the coexistence region 
5t/(4) < S < <5l(4)) show the two possible outcomes, obtained through different initializations, 
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corresponding to the two possible fixed points. 



5.2 Weighted belief learning 



In the weighted belief learning scenario, the information grasped by buyers has a larger weight 
than that of non-buyers. This scenario aims at modelizing situations where buyers have first 
hand knowledge of the quantities they try to estimate (payoffs or fraction of buyers) whereas 
individuals that do not afford the risk of buying have less faithful information. In equation ( 12 ) 
this is achieved whenever < A < 1. 
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Figure 5: Weighted belief learning (/j = 0.5, A = 0.5). rj at equilibrium versus S for j — 1 and j — 4, for s-learning 
(above) and ^-learning (below). 



The fractions of buyers rj at convergence with A = 0.5 are plotted on figures [5] With s- 
learning, both for j < js and j > js, equilibria are similar to those with myopic fictitious play, 
independently of the type of initialization, although we show later that the learned attractions 
are quite different. In contrast, the states reached with 77- learning crucially depend on A 
being smaller than 1. In fact, only with the optimistic initialization the agents may reach 
coordination on the optimal equilibrium (if it exists). With the other two initializations non- 
buyers systematically underevaluate the social effects by a factor A. As a result, the dj's are 
undercvaluated and the collective outcomes at equilibrium are not consistent with the phase 
diagram. At convergence 77 is smaller than the optimal value for a large range of 5 values (see 
figures pjj. This decrease in 77 is most dramatic with the pessimistic initialization. Since with 
the random initialization there are more buyers than with the pessimistic initialization from the 
beginning, more individuals can correctly estimate their surpluses, and the collective state at 
equilibrium has systematically a larger rj than when starting with the pessimistic initialization. 
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Figure 6: Weighted belief learning (/^ — 0.5): dependence of 77-lcarning on A: rj at equilibrium versus 8 for j — 1 
and j = 4 with A = 0.2 (above) and A = 0.8 (below). 



For smaller values of A, the misestimations of the social terms are even more conspicuous, 
leading the population to inefficient equilibria with low fractions of buyers for a larger range of 
S values (see figures [6] A = 0.2). Conversely, when A is larger, the actual and the estimated 77 
are closer to each other, giving results closer to those of fictitious play (see figures [6] A = 0.8). 
In the limit A — > 1 we obtain the results of section 15.11 The case A = is considered in the 
next section. 

The individual attractions at convergence of a representative system are represented against 
the individual idiosyncratic terms Xi on figures [7] for j ' = 1 and j ' = 4 and for different values of 
S. In contrast with myopic fictitious play, the slope of the attractions obtained with s-learning 
depends on whether individuals are buyers or not: for non-buyers the slope is A whereas it is 
1 for buyers, as may be seen on the upper figures [7] 



In the case of 77-learning it is clear from the updating rule (14) that the attractions as a 
function of a; i have aslope 1. However, because A < 1 , both with the pessimistic and the random 
initializations the fractions of non-buyers when j = 1 for 5 in the region 5\(j) < S < b — jA (see 
figure [5]) do not reach the saturation level expected from the phase diagram. The non-buyers 
are agents whose initial estimations r)i(O) determined negative attractions. When the correcting 
term jArj does not allow to compensate a negative value of 5 + x%, these agents persist in non 
buying. 

When j = 4, for 5 > 8l (j) there is a fraction of the population that does not buy, due to same 
reason as for j < js- This is why for 8 = — 1.5 > <5l(4), where saturation is expected on the basis 
of the phase diagram, we obtain rj < 1 with either random or pessimistic initializations. Like 
for j = 1, here also saturation is reached independently of the initial state only for 8 > b — jA. 
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rj(j — 1,8 — —0.5) — 0.382 and rj(j — 1, 8 — —3.1) — 0. For j — 4, when 8 — —2 attractions converge to two different 
fixed points (with rj — 0.03 and rj — 1, depending on the initial condition), whereas rj{j — 4, 8 — —1.5) — 0.62 and 
n{j = 4, 8 = -3 = . 



5.3 Reinforcement learning 

In reinforcement learning, only agents that buy are assumed to have the information necessary 



to estimate their attractions. In equation (12) this is achieved with A = 0. This learning 



paradigm is also called stimulus-response or rote learning in behavioral psychology 6 . It aims 
at modclizing risk-averse individuals that refrain from buying from the start, independently of 
the posted price. As we sec in the following, such behaviours may hinder the emergence of the 
Pareto-optimal equilibrium, where the payoffs are optimal for all the agents, for a large range 
of values of S. 

Like in weighted belief learning, the system's behavior with reinforcement learning strongly 
depends on the initial states. In fact, only individuals with 0^(0) > can actually learn from 
experience because their first decision is to buy. Therefore, the attractions of buyers (but only 
these) converge to the actual payoffs Si = Xi + 5 + jr), both with r\- and s-learning. Their 
values of at convergence present a slope 1 as a function of Xi. With s-learning non-buyers 
(whose attractions are negative) cannot use the information carried by the forgone payoffs. 
These agents decrease iteratively by a factor 1 — /i the absolute values of at each step of 
the learning process. Attractions keep thus their negative signs: the corresponding individuals 
persist in state Ui = and the attractions of non-buyers converge to = whatsoever the value 
of Xi. In figure[8]the corresponding at vs Xi present a vanishing slope, and there are individuals 
with a,i — evenly distributed over the Xi axis. Notice that even when 5 is large enough (low 



12 




s-learning 
j=4 (|i=0.5, a=o) 
random 
4 - optimistic 



2b 





-1 







il-learning 
j=4 (MLVA=0) 
random 
optimistic 




Figure 8: Reinforcement learning (/a — 0.5, A — 0). Attractions at convergence for three values of (5, as a function 
of Xi. s-lcarning (above): the fractions of buyers for j — 1 arc rj{j — 1,(5 — 1.1) — 0.52. rj{j — 1,(5 — —0.5) — 0.203 
and rj(j — 1, (5 — —3.1) — 0. For j = 4, 5 = —2, attractions converge to two different fixed points (with rj — 0.022 and 
rj — 1, depending on the initial condition), whereas rj{j — 4, S — —1.5) — 0.095 and rj{j — 4, (5 — —3) — 0. ^-learning 
(below): r}{j = 1, S = 1.1) = 0.985, T)(j = 1, S = -0.5) = 0.374 and rj(j = 1,8 - 3.1) = 0. For j = 4, S = -2 
attractions converge to two different fixed points (ij — 0.03 and rj — 1, depending on the initial condition), whereas 
rj(j = 4, (5 = -1.5) = 0.294 and rj{j = 4, S = -3) = 0. 



enough price) to allow everybody get positive payoffs, at equilibrium there remain non-buyers 
with vanishing attractions independently of their value of x^. On figure [9] the fraction of buyers 
•f] (with random initialization) is seen to be systematically smaller than the fraction expected 
from the phase diagram. Since in our random initialization setting the initial values a,;(0) are 
selected with equal probabilities of being positive or negative, the initial fraction of buyers is 
77(0) = 0.5. Since those who begin with <ij(0) < are unable to change their mind, the upper 
bound to rj is 1/2, as is seen in the upper figures [9] For the same reasons, with the pessimistic 
initialization nobody buys independently of 5. Only with the optimistic initialization all the 
individuals can learn and make correct estimations of their payoffs: the corresponding curves 
77 vs. S are similar to those with myopic best response. 

With ry-learning the behavior is closer to that of weighted belief learning: the curves ry(j; S) 
follow the same trends as in figures [6j The attractions of buyers (non-buyers) converge to 
a,i — d + Xi + jr) (a.j = S + Xi). The range of Xi values where individuals may have different 
a,i even if they have similar Xi is obtained like with weighted belief learning, putting A = 
in the equations. This gives —6 — jr/ < Xi < —S. This is illustrated on figures [8] (below), and 
explains why here also the values of rj at convergence are systematically smaller (or equal) to 
those expected at the Nash equilibrium, as may be seen on figures [9] (below) . Notice that with 
77-learning, even if non-buyers do not use the information about 77, they may still buy provided 
that Xi + S > 0, independently of the initial guess 77^ (0) . This is why 77 may be larger than with 
s-learning at convergence, see figure (below), and even reach saturation provided 6 is large 
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enough. The same argument as in the preceding section shows that saturation can be reached 
only if 5 > b. 



6 Discussion and Conclusion 

It is interesting to compare the results obtained with weighted belief learning to those with re- 
inforcement learning. In both cases, the equilibrium values of the attractions may be calculated 



by replacing <Xj(i), rj(t), u>i(t) in (13 1 and (14) by their asymptotic values at, r\ and cjf. With 
s-learning these are <ij = [A + (1 — A)uji\si'. the attractions of buyers converge to dj = Sj, i.e. 
they estimate correctly their expected surplus. Non-buyers estimate aj — Asj. With ^-learning 
we have a, = xi + 5 + jr)[A + (1 — A)u/j]. Thus buyers also correctly estimate their expected 
surplus, and non-buyers underestimate it, since their attractions converge to dj = Xi + 5 + jArj. 

With weighted belief s-learning, the agents always estimate the right sign of the attraction 
independently on whether they are buyers or non-buyers, so that the system converges to the 
theoretical Nash equilibrium despite the incorrect estimations by non-buyers. This is not true 
for reinforcement learning ( A = 0) , because in this case the attractions of non-buyers converge 
to a, = Si A — 0. As a result, at equilibrium we expect fewer buyers than with weighted belief 
learning, because with reinforcement learning individuals that have initial negative attractions 
persist in non-buying even if they could obtain positive payoffs. 

With 77-learning, like with s-lcarning, buyers' asymptotic attractions converge to the actual 
surpluses both with weighted belief and with reinforcement learning: a* = Sj. Non-buyers' 
surplus estimations converges to a, = s, — (1 — A)jrj, which may be negative even if Sj > 
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0. Therefore, in contrast with s-learning, weighted belief rj -learning may fail to reach the 
theoretical Nash equilibria. With reinforcement learning, on the other hand, 77-lcarning may 
be more performant than s-learning, since non-buyers' estimations converge to dj = 5 + Xi, 

1. e., they disregard the social component of the surplus but take into account correctly their 
idiosyncratic preferences. Therefore, the fraction of buyers increases with 6, without getting 
stuck at a value determined only by the initial conditions, as happens with s-learning. 

The comparison of s- and ^-learning with the same parameters shows that with weighted 
belief learning, ^-learning converges to fewer buyers than s-learning, because in the latter case 
the sign of the surplus is correctly estimated. This is a rather counterintuitive result, since 
individuals using s-learning have a poorer knowledge of the payoff structure. On the other hand, 
^-learning allows to get closer to the theoretical Nash equilibria because the agents know their 
preferences, and only misestimate the fraction of buyers. To summarize, with reinforcement 
learning the quality of the equilibria with the two learning scenarios is inversed with respect to 
the one in weighted belief learning. In rj -learning, agents with the a priori knowledge about Xi 
and j drive the system through learning to states with larger fractions rj than with s-learning, 
where agents do not have this a priori information. 

We only considered < A < 1, implying that non-buyers may only underestimate the 
learned quantity (be it the forgone payoff or the fraction of buyers). Values A > 1 allow to 
modelize the non-buyers regret about their chosen strategy. These A values can only lead 
to overestimations of the learned term, helping non-buyers to increase faster their attractions 
for buying. The result would be an acceleration of convergence. Since buyers make correct 
estimations, we expect that, except for reinforcement s- learning, the final states be the same 
as with fictitious play. With reinforcement s-learning, the results would be the same as those 
presented here. 

To conclude, our results show that systems with interacting rational agents with limited 
information may not reach the theoretical Nash equilibria, even when these are unique. If the 
social interactions are so strong that there are multiple Nash equilibria, the resulting collective 
state is very sensitive to the agents' initial guesses of the opportunity of buying. 

We restricted our simulations to systems where all the agents use the same learning rule. 
Further investigations should consider mixtures of different kinds of learners. 

Our agents used deterministic learning rules. One drawback is that their decisions are 
independent of the magnitude of the attraction: only its sign matters. Probabilistic decision 
rules, where the uncertainty of the choice is larger the closer the attraction to 0, have been 
studied in a related model where adaptive customers have to choose between different sellers 
[37j [27] . in a particular context where fictitious play is not possible. There, the existence of 
multi-equilibria is shown to lead to a transition between an unfaithful and a faithful behaviour 
(customers going to different sellers in the first case, and preferring one particular seller in 
the other case). Within our general framework we have studied the adaptive dynamics with 
probabilistic decision rules. A typical result is that the population reaches states in which 
decisions fluctuate close to the average ones. This stationary regime is in general close to 
the 'quantal response equilibrium' |24j described in economics. In addition, a more complex 
stationary state can be obtained when the choice uncertainty is strong enough. A detailed 
analysis of the collective behaviour under such probabilistic decision rules will be presented 
elsewhere [251. 
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